Skip to content

feat(cli): add Kaggle dataset integration and Croissant metadata parsing#11

Merged
VinciGit00 merged 1 commit intoScrapeGraphAI:mainfrom
closestfriend:feature/kaggle-integration
Feb 6, 2026
Merged

feat(cli): add Kaggle dataset integration and Croissant metadata parsing#11
VinciGit00 merged 1 commit intoScrapeGraphAI:mainfrom
closestfriend:feature/kaggle-integration

Conversation

@closestfriend
Copy link
Contributor

Description

Add Kaggle dataset integration and Croissant (ML Commons) metadata parsing to streamline dataset-to-TOON workflows. This enables users to download Kaggle datasets and convert them to TOON format in a single command.

Features

New CLI flags:

  • --kaggle - Treat input as Kaggle dataset slug
  • --croissant - Parse input as Croissant JSON-LD metadata
  • --file / -f - Select specific file from multi-file datasets

Usage examples:

# Download Kaggle dataset and convert to TOON
toon username/dataset-name --kaggle --stats

# Select specific file from dataset
toon username/dataset-name --kaggle --file data.csv

# Parse Croissant metadata to see schema
toon metadata.json --croissant

New Python API:

from toon import download_dataset, parse_croissant, csv_to_records

# Download and process Kaggle dataset
files = download_dataset("username/dataset-name")
csv_file = find_best_csv(files)

# Parse Croissant metadata
info = parse_croissant(metadata)
print(info['schema'])

Implementation

New module toon/kaggle.py provides:

  • download_dataset() - Download Kaggle datasets via kaggle CLI
  • find_best_csv() - Heuristic selection of main data file
  • csv_to_records() - CSV to list[dict] conversion
  • parse_croissant() - Extract schema from Croissant JSON-LD
  • croissant_to_summary() - Generate human-readable summaries
  • is_kaggle_slug() - Detect Kaggle dataset slug format

All imports are optional - gracefully degrades if kaggle package is not installed.

Type of Change

  • New feature (non-breaking change which adds functionality)

Testing

  • All tests pass
  • Added 12 new tests for Kaggle integration
  • Tested manually with real Kaggle datasets

Checklist

  • Code follows the project's style guidelines
  • Self-review completed
  • Documentation updated (CLI help, docstrings)
  • No new warnings or errors introduced

Add new --kaggle and --croissant CLI flags for streamlined dataset workflows:

- `toon username/dataset --kaggle` downloads and converts Kaggle datasets to TOON
- `toon metadata.json --croissant` parses ML Commons Croissant metadata
- `--file` flag to select specific files from multi-file datasets
- Auto-detection of Kaggle slugs (username/dataset-name format)

New module toon/kaggle.py provides:
- download_dataset(): Download Kaggle datasets via kaggle CLI
- find_best_csv(): Heuristic selection of main data file
- csv_to_records(): CSV to list[dict] conversion
- parse_croissant(): Extract schema from Croissant JSON-LD
- croissant_to_summary(): Generate human-readable dataset summaries

All functions are optional imports - gracefully degrades if kaggle
package is not installed.

Includes comprehensive test suite (12 tests, 100% pass).
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Kaggle dataset download support and Croissant (ML Commons) JSON-LD parsing to the TOON tooling so users can go from dataset metadata/slug to TOON output via the CLI (and via a small Python API surface).

Changes:

  • Introduces toon/kaggle.py with Kaggle CLI download utilities, CSV selection/conversion, and Croissant metadata parsing/summary helpers.
  • Extends toon CLI with --kaggle, --croissant, and --file/-f flows to download/parse and then encode to TOON.
  • Exposes Kaggle/Croissant helpers from toon/__init__.py and adds unit tests for the new module.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File Description
toon/kaggle.py New Kaggle/Croissant helper module (download via kaggle CLI, CSV heuristics, Croissant schema extraction & summary).
toon/cli.py Adds new CLI flags and execution paths for Kaggle downloads and Croissant parsing.
toon/__init__.py Exports Kaggle/Croissant helpers as part of the public API (attempts “optional” gating).
tests/test_kaggle.py Adds tests for slug detection, CSV conversion, Croissant parsing/summary, and CSV selection heuristic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +14 to +26
try:
from .kaggle import (
is_kaggle_slug,
download_dataset,
find_best_csv,
csv_to_records,
parse_croissant,
croissant_to_summary,
)
KAGGLE_AVAILABLE = True
except ImportError:
KAGGLE_AVAILABLE = False

Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KAGGLE_AVAILABLE is set based on importing toon.kaggle, but that module has no external imports, so this will be True even when the Kaggle CLI isn’t installed/configured. This makes the if not KAGGLE_AVAILABLE: branches effectively dead code and the error message about needing the “kaggle package” misleading. Consider removing this import-gating entirely and instead detect the kaggle executable (e.g., via shutil.which('kaggle')) or rely on download_dataset() raising a clear error, and update the messaging to refer to the Kaggle CLI/credentials rather than the Python package.

Copilot uses AI. Check for mistakes.


# Handle Kaggle dataset download
if args.kaggle or (KAGGLE_AVAILABLE and args.input and is_kaggle_slug(args.input)):
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition enables implicit Kaggle-slug auto-detection (treating args.input as Kaggle when it matches username/dataset), even if the user didn’t pass --kaggle. This behavior isn’t described in the PR description/CLI help and can change semantics for relative paths like data/user/file that don’t exist yet. Either require --kaggle explicitly or document the auto-detection behavior clearly (and consider making it opt-in).

Copilot uses AI. Check for mistakes.
Comment on lines +224 to +305
# Handle Kaggle dataset download
if args.kaggle or (KAGGLE_AVAILABLE and args.input and is_kaggle_slug(args.input)):
if not KAGGLE_AVAILABLE:
print('Error: Kaggle support requires the kaggle package. '
'Install with: pip install kaggle', file=sys.stderr)
return 1

try:
print(f'Downloading Kaggle dataset: {args.input}', file=sys.stderr)
files = download_dataset(args.input)

# Find the target file
if args.select_file:
target = next(
(f for f in files if args.select_file in f.name),
None
)
if not target:
print(f'Error: No file matching "{args.select_file}" in dataset',
file=sys.stderr)
print(f'Available files: {[f.name for f in files]}', file=sys.stderr)
return 1
else:
target = find_best_csv(files)
if not target:
# Try JSON files
json_files = [f for f in files if f.suffix.lower() == '.json']
target = json_files[0] if json_files else None

if not target:
print('Error: No CSV or JSON files found in dataset', file=sys.stderr)
return 1

print(f'Using: {target.name}', file=sys.stderr)

# Read and convert
content = target.read_text(encoding='utf-8', errors='replace')

if target.suffix.lower() == '.csv':
data = csv_to_records(content)
else:
data = json.loads(content)

# Encode to TOON
options = {
'delimiter': args.delimiter,
'indent': args.indent,
'key_folding': args.key_folding,
}
if args.flatten_depth is not None:
options['flatten_depth'] = args.flatten_depth

output_content = encode(data, options)
input_content = json.dumps(data) # For stats comparison

# Show statistics if requested
if args.stats:
input_tokens = count_tokens(input_content)
output_tokens = count_tokens(output_content)

print(f'Input (JSON): {len(input_content)} bytes', file=sys.stderr)
print(f'Output (TOON): {len(output_content)} bytes', file=sys.stderr)
if len(input_content) > 0:
print(f'Size reduction: {(1 - len(output_content) / len(input_content)) * 100:.1f}%',
file=sys.stderr)

if input_tokens is not None and output_tokens is not None:
print(f'Input tokens: {input_tokens}', file=sys.stderr)
print(f'Output tokens: {output_tokens}', file=sys.stderr)
print(f'Token reduction: {(1 - output_tokens / input_tokens) * 100:.1f}%',
file=sys.stderr)
else:
print('(Install tiktoken for token statistics)', file=sys.stderr)

print('---', file=sys.stderr)

write_output(output_content, args.output)
return 0

except Exception as e:
print(f'Error: {e}', file=sys.stderr)
return 1
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Kaggle/Croissant CLI flows added here aren’t covered by tests. Since tests/test_cli.py already exists, it would be good to add tests for --kaggle/--croissant behavior (mocking subprocess.run / download_dataset, and verifying file selection + output) to prevent regressions.

Copilot uses AI. Check for mistakes.

import csv
import io
import json
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import json is unused (only referenced in a docstring example). If the project enforces linting, this will fail; otherwise it’s still noise. Consider removing the import or using json in actual code.

Suggested change
import json

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +73
if output_dir is None:
output_dir = tempfile.mkdtemp(prefix="toon_kaggle_")

Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When output_dir is not provided, this creates a temp directory via mkdtemp() that is never cleaned up by the library or the CLI path (leaving behind toon_kaggle_* directories). Consider using tempfile.TemporaryDirectory() in the CLI flow (so it’s cleaned up), or documenting/returning the temp directory path and providing a cleanup option.

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +245
target = next(
(f for f in files if args.select_file in f.name),
None
)
if not target:
print(f'Error: No file matching "{args.select_file}" in dataset',
file=sys.stderr)
print(f'Available files: {[f.name for f in files]}', file=sys.stderr)
return 1
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File selection uses substring matching (args.select_file in f.name) and returns the first match from an arbitrary rglob() ordering. This can select the wrong file when multiple names contain the substring (e.g., data.csv vs mydata.csv). Prefer exact matches first (e.g., f.name == args.select_file), and if multiple matches exist, surface an error listing candidates.

Suggested change
target = next(
(f for f in files if args.select_file in f.name),
None
)
if not target:
print(f'Error: No file matching "{args.select_file}" in dataset',
file=sys.stderr)
print(f'Available files: {[f.name for f in files]}', file=sys.stderr)
return 1
# Prefer exact filename matches first
exact_matches = [f for f in files if f.name == args.select_file]
if len(exact_matches) == 1:
target = exact_matches[0]
elif len(exact_matches) > 1:
print(
f'Error: Multiple files named "{args.select_file}" found in dataset:',
file=sys.stderr,
)
for f in exact_matches:
print(f' - {f}', file=sys.stderr)
return 1
else:
# Fall back to substring matches, but require them to be unambiguous
substring_matches = [f for f in files if args.select_file in f.name]
if len(substring_matches) == 1:
target = substring_matches[0]
elif len(substring_matches) > 1:
print(
f'Error: Multiple files matching "{args.select_file}" in dataset:',
file=sys.stderr,
)
for f in substring_matches:
print(f' - {f}', file=sys.stderr)
return 1
else:
print(
f'Error: No file matching "{args.select_file}" in dataset',
file=sys.stderr,
)
print(
f'Available files: {[f.name for f in files]}',
file=sys.stderr,
)
return 1

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +218
data_types = field.get("dataType", ["unknown"])
type_str = data_types[0] if data_types else "unknown"
# Clean up schema.org prefixes
type_str = type_str.replace("sc:", "").replace("https://schema.org/", "")

Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field.get('dataType') is treated as a list (data_types[0]), but in JSON-LD it can also be a single string. If dataType is a string, this will take the first character (e.g., 'sc:Integer' -> 's') and produce an incorrect schema. Normalize dataType to a list (or handle str explicitly) before selecting the first type.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +50
# Kaggle integration (optional - requires kaggle installation)
try:
from .kaggle import (
download_dataset,
find_best_csv,
csv_to_records,
parse_croissant,
croissant_to_summary,
is_kaggle_slug,
)
_KAGGLE_AVAILABLE = True
except ImportError:
_KAGGLE_AVAILABLE = False
def download_dataset(*args, **kwargs):
raise ImportError("download_dataset requires kaggle to be installed. Please install kaggle to use this feature.")
def find_best_csv(*args, **kwargs):
raise ImportError("find_best_csv requires kaggle to be installed. Please install kaggle to use this feature.")
def csv_to_records(*args, **kwargs):
raise ImportError("csv_to_records requires kaggle to be installed. Please install kaggle to use this feature.")
def parse_croissant(*args, **kwargs):
raise ImportError("parse_croissant requires kaggle to be installed. Please install kaggle to use this feature.")
def croissant_to_summary(*args, **kwargs):
raise ImportError("croissant_to_summary requires kaggle to be installed. Please install kaggle to use this feature.")
def is_kaggle_slug(*args, **kwargs):
raise ImportError("is_kaggle_slug requires kaggle to be installed. Please install kaggle to use this feature.")
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try/except ImportError gating here is misleading: toon.kaggle is part of this package and only uses stdlib imports, so this import will succeed regardless of whether the user has the Kaggle CLI installed. As a result, the fallback stubs will never be used, and the error messages about needing to “install kaggle” don’t reflect the actual runtime dependency (the kaggle executable + credentials). Consider removing the ImportError gating and documenting/checking for the Kaggle CLI instead, or explicitly gating on shutil.which('kaggle') if you want a true “availability” switch.

Copilot uses AI. Check for mistakes.
Comment on lines +309 to +310
if not KAGGLE_AVAILABLE:
print('Error: Croissant support requires the kaggle module.', file=sys.stderr)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --croissant path is gated on KAGGLE_AVAILABLE and the error message says it “requires the kaggle module”, but Croissant parsing here is pure-stdlib (json + parse_croissant) and shouldn’t require the Kaggle CLI. Consider decoupling Croissant support from any Kaggle availability checks and updating the message accordingly (or dropping the check entirely).

Suggested change
if not KAGGLE_AVAILABLE:
print('Error: Croissant support requires the kaggle module.', file=sys.stderr)
# Croissant support depends on the availability of parse_croissant,
# but does not require the Kaggle CLI itself.
if 'parse_croissant' not in globals():
print('Error: Croissant support is not available in this installation.', file=sys.stderr)

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,190 @@
"""Tests for Kaggle integration module."""

import pytest
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'pytest' is not used.

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@VinciGit00 VinciGit00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VinciGit00 VinciGit00 merged commit b9a5204 into ScrapeGraphAI:main Feb 6, 2026
8 checks passed
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

🎉 This PR is included in version 1.6.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants